\chapter{Design}\label{chap:design}
In this Chapter the two approach to the parallelization of Keccak will be presented. As stated in Chap. \ref{chap:introduction}, the solutions proposed try to reduce the time needed to the hash computation by addressing the problem in two diametrically opposed ways: the first solution, from now on referred to as "local parallelization", attempts to increase the time performance by means of a bunch of threads collaborating for the computation of a single hash; the second solution, from now on referred to as "global parallelization", concentrates on the simultaneous calculation of the hash of different messages, taking advantage from the consideration that usually in the real world the software for the computation of the hash is installed on machines that must serve thousand of different requests per second.\\
In the following sections the two approaches will be extensively presented.\\


\section{Local Parallelization}
The internal status of the Keccak permutation function is composed by 25 words of 64 bits in a grid 5x5. Since in CUDA the bit to bit operations are rather slow, we decided to use 25 thread, one for every word of the internal state. As described below, every single thread, during the computation, is responsible for calculating the value of a single cell of the matrix, identified by the positions X and Y of the state matrix that are the same of thread in the CUDA thread-grid.

\subsection{Chi}
In the implementation of chi we used a matrix 10x5 of 64 bits words, containing the internal state of the Hash function after the previous step, replicated two times. This because, in CUDA, the modulo operator is rather slow compared with cpu; using a 10x5 matrix the threads that need words out of the first 5 columns of the matrix can safety complete the operations. The operators NOT, XOR and AND have been used normally. The result is written in a new matrix that will be the new internal state.

\subsection{Theta}
In the implementation of Theta the internal state is duplicated in a matrix 5x10 for the same reason described for Chi. Every single thread calculate the C value of its own column so that it is repeated 5 times (one for every thread of the column). This procedure is aimed to avoid using IF-patterns that would break the parallelism between thread. The D matrix is calculated in the same way. For the computation of the ROT matrix we were forced to use shift operators that can decrement the performance. At the end every thread copy its result in the corresponding cell of a new matrix that will be the new internal state.

\subsection{Pi}
The Pi step is implemented using 2 matrix 5x5 with the coordinates X and Y of the new positions that the words will have after the permutation. Every thread read this coordinates and copy its state word in a new matrix, in the position read, that will be the new internal state.

\subsection{Rho}
Like in the Theta, in Rho we were forced to used shift operators to implement the bit word rotation; the offset of the rotation depends on the position of the word to rotate in the internal state and is loaded as constant in a 5x5 matrix. The single thread read the offset value and make the rotation of its own word and then copies the result in a new matrix.

\subsection{Iota}
The implementation of Iota is obviously the more simple since in Iota there is only a bit to bit xor between the first word of the internal state and a 64 bits constant different in every round. Those round constant are preloaded and the operators has been used normally.


\section{Global Parallelization}
The original Keccak structure have been almost completely maintained in this solution, even thought many adjustments have been made to maximize the performance on GPU. This optimization process required the main effort: the tuning of both the execution parameters and the compiler directives leads to the production of very different algorithms before the best configuration has been discovered.\\
Following a description of the base algorithm and a discussion of the most important design choices.\\

\subsection{Base Algorithm}
All the designed algorithms have a common base, showed in Alg. \ref{alg:cuda-keccak}
\begin{algorithm}                      % enter the algorithm environment
\caption{Calculate $y = x^n$}          % give the algorithm a caption
\label{alg:cuda-keccak}                % and a label for \ref{} commands later in the document
\begin{algorithmic}                    % enter the algorithmic environment
\REQUIRE $n \geq 0 \vee x \neq 0$
\ENSURE $y = x^n$
\STATE $y \Leftarrow 1$
\IF{$n < 0$}
\STATE $X \Leftarrow 1 / x$
\STATE $N \Leftarrow -n$
\ELSE
\STATE $X \Leftarrow x$
\STATE $N \Leftarrow n$
\ENDIF
\WHILE{$N \neq 0$}
\IF{$N$ is even}
\STATE $X \Leftarrow X \times X$
\STATE $N \Leftarrow N / 2$
\ELSE[$N$ is odd]
\STATE $y \Leftarrow y \times X$
\STATE $N \Leftarrow N - 1$
\ENDIF
\ENDWHILE
\end{algorithmic}
\end{algorithm}
The only difference between all the solutions designed is the kernel adopted for the computations. Three different kernels have been produced:\\
\begin{itemize}
\item \textbf{Kernel Base} Almost identical to the Keccak reference permutation function.
\item \textbf{Kernel Unrolled} All the loop have been unrolled in order to avoid thr continuous flush of the computation pipeline.
\item \textbf{Kernel SH} Local variable have been placed into shared memory to prevent the usage of the slow Local Memory.
\end{itemize}
The test performed showed that 'Kernel Unrolled' is the most effective. Further details on this in Chap. \ref{chap:conclusions}.

\subsection{Memory Transfers}
Memory transfers between Host and Device Memory are the main cause of performance lost in the CUDA applications.\\
In this work, from the logical point of view, there is a need for a single large data transfers from the Host to the Device: the messages by which to evaluate the hashes. The retrieval of the computed hashes can be considered negligible compared to the previous data flow. The most obvious solution to provide the data for the computations to the GPU is of course a single big data transfers from the Host, however, this idea has been rejected because judged unnecessarily onerous. This assertion emerges from the consideration that only one token of each message can be considered at a time, because of the serial nature of Keccak. For this reason the data flow has been partitioned into tokes and performed using a double buffer strategy: during the computations of the \textit{i th} token, the data needed for the \textit{i+1 th} one are loaded into a separate memory location. It has been verified that the latency of these memory transfers is completely hidden by the kernel execution.

\subsection{Loop Unrolling}
As mentioned in Chap. \ref{chap:cuda-overview}, no branch prediction strategy is implemented into CUDA devices. For this reason loops can be a source of trouble in CUDA applications, especially if they are many and consist of a few instructions, as in this case.\\
Considering that the nvcc compiler has often an unpredictable behaviour and that it is still an open highly variable project, the loops founded in the original code have been physically fully unrolled instead of using the compiler unroll directive. The results obtained by this operation were satisfying and have influenced the choice of the unrolled kernel as reference version for the project.

\subsection{Registers Usage}
Registers usage is a sore point for all the CUDA applications. These are the default location of local variables and contain temporary results of arithmetic operations.\\
In order to maximize the occupancy, and consequently the performance, the number of used registers per thread must be kept under very limited thresholds, depending on the compute capability of the device. On the other hand, registers are the fastest memory locations available in the GPU and for this reason their usage is strongly encouraged, still for the performance implications.\\
The nvcc compiler by default attempt to maximize the registers usage. Unfortunately, when the total amount of memory requested by a thread exceed the registers availability, the compiler can decide to place local variables and temporary result into a special part of the Global Memory, the so called Local Memory. This locations are private and have the scope of a single threads like registers, but unlike registers these are located off-chip and have the same high latency of the Global Memory. For this reason the usage of Local Memory must be avoided.\\
This work, by its own nature, is highly registers intensive and for this reason suffers from limited availability of fast memory: a big part of the local variables of the threads are allocated into Local Memory. In order to reduce the impact on performance, a special kernel that uses the shared memory as local variables location has been designed.

\subsection{Left Shift}
A well-known nvcc bug has been a source of problems in the early stages of development. The bug, reported to the nvidia community a long time ago but not yet corrected, concerns to the inability of performing bitwise left shift when the shift factor is not stored into a constant variable. As suggested by nvidia itself, in order to bypass the problem the shift factor have been copied into a constant variable immediately before the shift operations.
